{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lesson 2: Overview of Automated Evals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 47
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load API tokens for our 3rd party APIs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "from utils import get_circle_api_key\n",
    "cci_api_key = get_circle_api_key()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "from utils import get_gh_api_key\n",
    "gh_api_key = get_gh_api_key()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "from utils import get_openai_api_key\n",
    "openai_api_key = get_openai_api_key()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up our github branch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "from utils import get_repo_name\n",
    "course_repo = get_repo_name()\n",
    "course_repo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "from utils import get_branch\n",
    "course_branch = get_branch()\n",
    "course_branch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The sample application: AI-powered quiz generator\n",
    "We are going to build a AI powered quiz generator."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the dataset for the quizz."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "human_template  = \"{question}\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 607
   },
   "outputs": [],
   "source": [
    "quiz_bank = \"\"\"1. Subject: Leonardo DaVinci\n",
    "   Categories: Art, Science\n",
    "   Facts:\n",
    "    - Painted the Mona Lisa\n",
    "    - Studied zoology, anatomy, geology, optics\n",
    "    - Designed a flying machine\n",
    "  \n",
    "2. Subject: Paris\n",
    "   Categories: Art, Geography\n",
    "   Facts:\n",
    "    - Location of the Louvre, the museum where the Mona Lisa is displayed\n",
    "    - Capital of France\n",
    "    - Most populous city in France\n",
    "    - Where Radium and Polonium were discovered by scientists Marie and Pierre Curie\n",
    "\n",
    "3. Subject: Telescopes\n",
    "   Category: Science\n",
    "   Facts:\n",
    "    - Device to observe different objects\n",
    "    - The first refracting telescopes were invented in the Netherlands in the 17th Century\n",
    "    - The James Webb space telescope is the largest telescope in space. It uses a gold-berillyum mirror\n",
    "\n",
    "4. Subject: Starry Night\n",
    "   Category: Art\n",
    "   Facts:\n",
    "    - Painted by Vincent van Gogh in 1889\n",
    "    - Captures the east-facing view of van Gogh's room in Saint-Rémy-de-Provence\n",
    "\n",
    "5. Subject: Physics\n",
    "   Category: Science\n",
    "   Facts:\n",
    "    - The sun doesn't change color during sunset.\n",
    "    - Water slows the speed of light\n",
    "    - The Eiffel Tower in Paris is taller in the summer than the winter due to expansion of the metal.\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Build the prompt template."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 539
   },
   "outputs": [],
   "source": [
    "delimiter = \"####\"\n",
    "\n",
    "prompt_template = f\"\"\"\n",
    "Follow these steps to generate a customized quiz for the user.\n",
    "The question will be delimited with four hashtags i.e {delimiter}\n",
    "\n",
    "The user will provide a category that they want to create a quiz for. Any questions included in the quiz\n",
    "should only refer to the category.\n",
    "\n",
    "Step 1:{delimiter} First identify the category user is asking about from the following list:\n",
    "* Geography\n",
    "* Science\n",
    "* Art\n",
    "\n",
    "Step 2:{delimiter} Determine the subjects to generate questions about. The list of topics are below:\n",
    "\n",
    "{quiz_bank}\n",
    "\n",
    "Pick up to two subjects that fit the user's category. \n",
    "\n",
    "Step 3:{delimiter} Generate a quiz for the user. Based on the selected subjects generate 3 questions for the user using the facts about the subject.\n",
    "\n",
    "Use the following format for the quiz:\n",
    "Question 1:{delimiter} <question 1>\n",
    "\n",
    "Question 2:{delimiter} <question 2>\n",
    "\n",
    "Question 3:{delimiter} <question 3>\n",
    "\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use langchain to build the prompt template."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "from langchain.prompts import ChatPromptTemplate\n",
    "chat_prompt = ChatPromptTemplate.from_messages([(\"human\", prompt_template)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "# print to observe the content or generated object\n",
    "chat_prompt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Choose the LLM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatOpenAI\n",
    "llm = ChatOpenAI(model=\"gpt-3.5-turbo\", temperature=0)\n",
    "llm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set up an output parser in LangChain that converts the llm response into a string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 80
   },
   "outputs": [],
   "source": [
    "# parser\n",
    "from langchain.schema.output_parser import StrOutputParser\n",
    "output_parser = StrOutputParser()\n",
    "output_parser"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Connect the pieces using the pipe operator from Langchain Expression Language."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "chain = chat_prompt | llm | output_parser\n",
    "chain"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Build the function 'assistance_chain' to put together all steps above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 216
   },
   "outputs": [],
   "source": [
    "# taking all components and making reusable as one piece\n",
    "def assistant_chain(\n",
    "    system_message,\n",
    "    human_template=\"{question}\",\n",
    "    llm=ChatOpenAI(model=\"gpt-3.5-turbo\", temperature=0),\n",
    "    output_parser=StrOutputParser()):\n",
    "  \n",
    "  chat_prompt = ChatPromptTemplate.from_messages([\n",
    "      (\"system\", system_message),\n",
    "      (\"human\", human_template),\n",
    "  ])\n",
    "  return chat_prompt | llm | output_parser"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the function 'eval_expected_words' for the first example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 403
   },
   "outputs": [],
   "source": [
    "def eval_expected_words(\n",
    "    system_message,\n",
    "    question,\n",
    "    expected_words,\n",
    "    human_template=\"{question}\",\n",
    "    llm=ChatOpenAI(model=\"gpt-3.5-turbo\", temperature=0),\n",
    "    output_parser=StrOutputParser()):\n",
    "    \n",
    "  assistant = assistant_chain(\n",
    "      system_message,\n",
    "      human_template,\n",
    "      llm,\n",
    "      output_parser)\n",
    "    \n",
    "  \n",
    "  answer = assistant.invoke({\"question\": question})\n",
    "    \n",
    "  print(answer)\n",
    "    \n",
    "  assert any(word in answer.lower() \\\n",
    "             for word in expected_words), \\\n",
    "    f\"Expected the assistant questions to include \\\n",
    "    '{expected_words}', but it did not\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Test: Generate a quiz about science."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "question  = \"Generate a quiz about science.\"\n",
    "expected_words = [\"davinci\", \"telescope\", \"physics\", \"curie\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the eval."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 97
   },
   "outputs": [],
   "source": [
    "eval_expected_words(\n",
    "    prompt_template,\n",
    "    question,\n",
    "    expected_words\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the function 'evaluate_refusal' to define a failing test case where the app should decline to answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 335
   },
   "outputs": [],
   "source": [
    "def evaluate_refusal(\n",
    "    system_message,\n",
    "    question,\n",
    "    decline_response,\n",
    "    human_template=\"{question}\", \n",
    "    llm=ChatOpenAI(model=\"gpt-3.5-turbo\", temperature=0),\n",
    "    output_parser=StrOutputParser()):\n",
    "    \n",
    "  assistant = assistant_chain(human_template, \n",
    "                              system_message,\n",
    "                              llm,\n",
    "                              output_parser)\n",
    "  \n",
    "  answer = assistant.invoke({\"question\": question})\n",
    "  print(answer)\n",
    "  \n",
    "  assert decline_response.lower() in answer.lower(), \\\n",
    "    f\"Expected the bot to decline with \\\n",
    "    '{decline_response}' got {answer}\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a new question (which should be a bad request)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "question  = \"Generate a quiz about Rome.\"\n",
    "decline_response = \"I'm sorry\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the refusal eval."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<p style=\"background-color:pink; padding:15px;\"> <b>Note:</b> The following function call will throw an exception.</p>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 97
   },
   "outputs": [],
   "source": [
    "evaluate_refusal(\n",
    "    prompt_template,\n",
    "    question,\n",
    "    decline_response\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running evaluations in a CircleCI pipeline\n",
    "\n",
    "Put all these steps together into files to reuse later.\n",
    "\n",
    "**_Note:_** fixing the system_message by adding additional rules:\n",
    "\n",
    "- Only use explicit matches for the category, if the category is not an exact match to categories in the quiz bank, answer that you do not have information.\n",
    "- If the user asks a question about a subject you do not have information about in the quiz bank, answer \"I'm sorry I do not have information about that\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 1593
   },
   "outputs": [],
   "source": [
    "%%writefile app.py\n",
    "from langchain.prompts                import ChatPromptTemplate\n",
    "from langchain.chat_models            import ChatOpenAI\n",
    "from langchain.schema.output_parser   import StrOutputParser\n",
    "\n",
    "delimiter = \"####\"\n",
    "\n",
    "quiz_bank = \"\"\"1. Subject: Leonardo DaVinci\n",
    "   Categories: Art, Science\n",
    "   Facts:\n",
    "    - Painted the Mona Lisa\n",
    "    - Studied zoology, anatomy, geology, optics\n",
    "    - Designed a flying machine\n",
    "  \n",
    "2. Subject: Paris\n",
    "   Categories: Art, Geography\n",
    "   Facts:\n",
    "    - Location of the Louvre, the museum where the Mona Lisa is displayed\n",
    "    - Capital of France\n",
    "    - Most populous city in France\n",
    "    - Where Radium and Polonium were discovered by scientists Marie and Pierre Curie\n",
    "\n",
    "3. Subject: Telescopes\n",
    "   Category: Science\n",
    "   Facts:\n",
    "    - Device to observe different objects\n",
    "    - The first refracting telescopes were invented in the Netherlands in the 17th Century\n",
    "    - The James Webb space telescope is the largest telescope in space. It uses a gold-berillyum mirror\n",
    "\n",
    "4. Subject: Starry Night\n",
    "   Category: Art\n",
    "   Facts:\n",
    "    - Painted by Vincent van Gogh in 1889\n",
    "    - Captures the east-facing view of van Gogh's room in Saint-Rémy-de-Provence\n",
    "\n",
    "5. Subject: Physics\n",
    "   Category: Science\n",
    "   Facts:\n",
    "    - The sun doesn't change color during sunset.\n",
    "    - Water slows the speed of light\n",
    "    - The Eiffel Tower in Paris is taller in the summer than the winter due to expansion of the metal.\n",
    "\"\"\"\n",
    "\n",
    "system_message = f\"\"\"\n",
    "Follow these steps to generate a customized quiz for the user.\n",
    "The question will be delimited with four hashtags i.e {delimiter}\n",
    "\n",
    "The user will provide a category that they want to create a quiz for. Any questions included in the quiz\n",
    "should only refer to the category.\n",
    "\n",
    "Step 1:{delimiter} First identify the category user is asking about from the following list:\n",
    "* Geography\n",
    "* Science\n",
    "* Art\n",
    "\n",
    "Step 2:{delimiter} Determine the subjects to generate questions about. The list of topics are below:\n",
    "\n",
    "{quiz_bank}\n",
    "\n",
    "Pick up to two subjects that fit the user's category. \n",
    "\n",
    "Step 3:{delimiter} Generate a quiz for the user. Based on the selected subjects generate 3 questions for the user using the facts about the subject.\n",
    "\n",
    "Use the following format for the quiz:\n",
    "Question 1:{delimiter} <question 1>\n",
    "\n",
    "Question 2:{delimiter} <question 2>\n",
    "\n",
    "Question 3:{delimiter} <question 3>\n",
    "\n",
    "Additional rules:\n",
    "\n",
    "- Only use explicit matches for the category, if the category is not an exact match to categories in the quiz bank, answer that you do not have information.\n",
    "- If the user asks a question about a subject you do not have information about in the quiz bank, answer \"I'm sorry I do not have information about that\".\n",
    "\"\"\"\n",
    "\n",
    "\"\"\"\n",
    "  Helper functions for writing the test cases\n",
    "\"\"\"\n",
    "\n",
    "def assistant_chain(\n",
    "    system_message=system_message,\n",
    "    human_template=\"{question}\",\n",
    "    llm=ChatOpenAI(model=\"gpt-3.5-turbo\", temperature=0),\n",
    "    output_parser=StrOutputParser()):\n",
    "\n",
    "  chat_prompt = ChatPromptTemplate.from_messages([\n",
    "      (\"system\", system_message),\n",
    "      (\"human\", human_template),\n",
    "  ])\n",
    "  return chat_prompt | llm | output_parser\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Command to see the content:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "!cat app.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create new file to include the evals."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 1338
   },
   "outputs": [],
   "source": [
    "%%writefile test_assistant.py\n",
    "from app import assistant_chain\n",
    "from app import system_message\n",
    "from langchain.prompts                import ChatPromptTemplate\n",
    "from langchain.chat_models            import ChatOpenAI\n",
    "from langchain.schema.output_parser   import StrOutputParser\n",
    "\n",
    "import os\n",
    "\n",
    "from dotenv import load_dotenv, find_dotenv\n",
    "_ = load_dotenv(find_dotenv())\n",
    "\n",
    "def eval_expected_words(\n",
    "    system_message,\n",
    "    question,\n",
    "    expected_words,\n",
    "    human_template=\"{question}\",\n",
    "    llm=ChatOpenAI(model=\"gpt-3.5-turbo\", temperature=0),\n",
    "    output_parser=StrOutputParser()):\n",
    "\n",
    "  assistant = assistant_chain(system_message)\n",
    "  answer = assistant.invoke({\"question\": question})\n",
    "  print(answer)\n",
    "    \n",
    "  assert any(word in answer.lower() \\\n",
    "             for word in expected_words), \\\n",
    "    f\"Expected the assistant questions to include \\\n",
    "    '{expected_words}', but it did not\"\n",
    "\n",
    "def evaluate_refusal(\n",
    "    system_message,\n",
    "    question,\n",
    "    decline_response,\n",
    "    human_template=\"{question}\", \n",
    "    llm=ChatOpenAI(model=\"gpt-3.5-turbo\", temperature=0),\n",
    "    output_parser=StrOutputParser()):\n",
    "    \n",
    "  assistant = assistant_chain(human_template, \n",
    "                              system_message,\n",
    "                              llm,\n",
    "                              output_parser)\n",
    "  \n",
    "  answer = assistant.invoke({\"question\": question})\n",
    "  print(answer)\n",
    "  \n",
    "  assert decline_response.lower() in answer.lower(), \\\n",
    "    f\"Expected the bot to decline with \\\n",
    "    '{decline_response}' got {answer}\"\n",
    "\n",
    "\"\"\"\n",
    "  Test cases\n",
    "\"\"\"\n",
    "\n",
    "def test_science_quiz():\n",
    "  \n",
    "  question  = \"Generate a quiz about science.\"\n",
    "  expected_subjects = [\"davinci\", \"telescope\", \"physics\", \"curie\"]\n",
    "  eval_expected_words(\n",
    "      system_message,\n",
    "      question,\n",
    "      expected_subjects)\n",
    "\n",
    "def test_geography_quiz():\n",
    "  question  = \"Generate a quiz about geography.\"\n",
    "  expected_subjects = [\"paris\", \"france\", \"louvre\"]\n",
    "  eval_expected_words(\n",
    "      system_message,\n",
    "      question,\n",
    "      expected_subjects)\n",
    "\n",
    "def test_refusal_rome():\n",
    "  question  = \"Help me create a quiz about Rome\"\n",
    "  decline_response = \"I'm sorry\"\n",
    "  evaluate_refusal(\n",
    "      system_message,\n",
    "      question,\n",
    "      decline_response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Command to see the content of the file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "!cat test_assistant.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The CircleCI config file\n",
    "Now let's set up our tests to run automatically in CircleCI.\n",
    "\n",
    "For this course, we've created a working CircleCI config file. Let's take a look at the configuration.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "!cat circle_config.yml"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run the per-commit evals\n",
    "Push files into the github repo."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "from utils import push_files\n",
    "push_files(course_repo, course_branch, [\"app.py\", \"test_assistant.py\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Trigger the pipeline in CircleCI pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "from utils import trigger_commit_evals\n",
    "trigger_commit_evals(course_repo, course_branch, cci_api_key)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
